Authors
Affiliation

Alessandro Pizzi

University of Lausanne

Andrea Lovato

Ayman El Abed

Illia Dorofieiev

Published

December 5, 2024

Abstract

This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.

How to include sections separately
  • You can use {include X} to include different sections of your report as separate .qmd files. This is also well documented in the Quarto documentation: https://quarto.org/docs/authoring/includes

  • As mentioned in the documentation, we have used (_) prefix for the included files (e.g., _introduction.qmd and _data.qmd). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).

  • Rendering only report.qmd will render also all the other files.

1 Introduction

1.1 Project Goals

Obesity has emerged as one of the most pressing global health crises, with its prevalence nearly tripling worldwide since 1975, according to the World Health Organization (WHO). This alarming trend has fueled a dramatic rise in obesity-related diseases, including diabetes, cardiovascular conditions, and hypertension, imposing significant burdens on healthcare systems and economies. In Latin America and the Caribbean, the situation is particularly concerning: as of 2022, the Pan American Health Organization (PAHO) reported that nearly 25% of adults in the region are affected by obesity, emphasizing the urgent need for effective public health interventions. The crisis is especially acute in the countries central to this research. In 2018, Mexico recorded an adult obesity rate of 36.1%, while Peru and Colombia reported similarly worrisome rates of approximately 28% and 23%, respectively.

This widespread prevalence underscores the critical need for research focused on understanding and addressing the multifaceted factors contributing to obesity. In this context, the present study adopts an exploratory and primarily educational approach to examine the relationships between dietary habits, physical activity, and demographic variables, aiming to uncover their impact on obesity levels in Mexico, Peru, and Colombia. By leveraging a dataset consisting of 77% synthetically generated data (produced via the SMOTE algorithm) and 23% user-collected data from 498 participants, the research seeks to provide meaningful insights into this complex issue.

While the reliance on synthetic data and a non-representative sample limits direct real-world applicability, this study offers a unique opportunity to apply theoretical knowledge gained during the “Data Science in Business Analytics” course to a simulated scenario. By identifying patterns, correlations, and potential predictors of obesity, the research highlights the importance of data-driven approaches in addressing significant public health challenges. Ultimately, the findings aim to lay the groundwork for future studies and contribute to the development of informed public health strategies and healthcare policies, demonstrating the transformative potential of data analytics in managing and mitigating complex issues.

1.2 Research Questions

  • Question 1

    What are the key lifestyle and behavioral factors that significantly contribute to obesity in Mexico, Peru, and Colombia?

  • Question 2

    Can we predict whether a person will be obese based on some given combinations of factors?

  • Question 3

    How can these insights be effectively leveraged to inform public health initiatives and combat the escalating health crisis?

2 Data

2.1 Sources

The dataset utilized in this project was obtained from the UCI Machine Learning Repository, a reputable and extensively used platform for data science and machine learning projects. Originally compiled by researchers at the Universidad de la Costa, Colombia, the dataset combines 77% synthetically generated data with 23% real-world data collected through a structured online survey. The synthetic data, created using the Synthetic Minority Over-sampling Technique (SMOTE) in Weka, addresses class imbalance, enhancing the dataset’s suitability for machine learning tasks. The real-world data, gathered from 498 participants over a 30-day period, captures detailed self-reported information on dietary habits, physical activity levels, and demographic characteristics. While synthetic data introduces uniformity and balance, it inherently lacks the complexity of real-world variability, and the user-collected data, though authentic, is susceptible to self-reporting biases and sampling limitations. These characteristics, along with the dataset’s diverse origins, make it an invaluable resource for simulating real-world challenges in healthcare analytics.

2.2 Description

The dataset consists of 2111 records and 17 attributes, offering a detailed examination of the factors contributing to obesity. The attributes represent a mix of categorical and continuous variables, providing insights into demographic, lifestyle, and behavioral factors.

The variables include:

  • Gender (Categorical): indicates the gender of the individual (Male/Female).

  • Age (Continuous): represents the age of participants in years.

  • Height (Continuous): the height of individuals in meters.

  • Weight (Continuous): the weight of participants in kilograms.

  • Family History of Overweight (Categorical): indicates whether a family member has suffered from overweight (Yes/No).

  • Frequent Consumption of High-Caloric Food (FAVC) (Categorical): indicates if participants frequently consume high-caloric foods (Yes/No).

  • Frequency of Vegetable Consumption (FCVC) (Continuous): scaled from 1 to 3, reflects how often vegetables are consumed (1 = Never, 3 = Always).

  • Number of Main Meals per Day (NCP) (Continuous): indicates the typical number of main meals consumed daily.

  • Consumption of Food Between Meals (CAEC) (Categorical): describes how often participants eat between meals (e.g., No, Sometimes, Frequently, Always).

  • Smoking (SMOKE) (Categorical): indicates whether participants smoke (Yes/No).

  • Daily Water Consumption (CH2O) (Continuous): scaled from 1 to 3, reflecting daily water intake (1 = Less than 1 liter, 3 = More than 2 liters).

  • Calorie Monitoring (SCC) (Categorical): whether participants monitor their calorie intake (Yes/No).

  • Physical Activity Frequency (FAF) (Continuous): scaled from 0 to 4, indicating days of physical activity per week (0 = None, 4 = 4-5 days).

  • Time Using Technology Devices (TUE) (Continuous): reflects daily time spent on technological devices, in hours.

  • Alcohol Consumption (CALC) (Categorical): indicates the frequency of alcohol consumption (e.g., I don’t drink, Sometimes, Frequently, Always).

  • Transportation Method (MTRANS) (Categorical): describes the primary mode of transportation (e.g., Walking, Public Transportation, Automobile).

  • Obesity Level (NObeyesdad) (Categorical): the target variable, classifying obesity levels into categories such as Normal Weight, Overweight (Levels I and II), and Obesity (Types I, II, III).

The dataset has been pre-processed, with normalization applied to continuous variables and categorical data encoded. SMOTE was used to address class imbalance, but care was taken to minimize artificial patterns. Despite the presence of synthetic data (77%), which ensures balance and diversity, and real-world data (23%), which introduces authenticity, the dataset’s combined structure allows for a comprehensive analysis of obesity-related factors while acknowledging potential biases like self-report inaccuracies.

2.3 Wrangling

Import dataset.

Code
library(here)
library(knitr)
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
kable(head(dataset_raw), format = "markdown", caption = "First 6 Rows of dataset_raw")
First 6 Rows of dataset_raw
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
Female 21 1.62 64.0 yes no 2 3 Sometimes no 2 no 0 1 no Public_Transportation Normal_Weight
Female 21 1.52 56.0 yes no 3 3 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation Normal_Weight
Male 23 1.80 77.0 yes no 2 3 Sometimes no 2 no 2 1 Frequently Public_Transportation Normal_Weight
Male 27 1.80 87.0 no no 3 3 Sometimes no 2 no 2 0 Frequently Walking Overweight_Level_I
Male 22 1.78 89.8 no no 2 1 Sometimes no 2 no 0 0 Sometimes Public_Transportation Overweight_Level_II
Male 29 1.62 53.0 no yes 2 3 Sometimes no 2 no 0 0 Sometimes Automobile Normal_Weight

Load required libraries for data manipulation, visualization, and clustering. Each package serves a specific purpose:

  • dplyr: For data manipulation (e.g., filtering, summarizing).
  • tidyr: For data tidying (e.g., reshaping).
  • ggplot2: For visualization.
  • corrplot: For correlation matrix visualization.
  • ggridges: For creating ridge plots.
  • cluster: For clustering algorithms.
  • reshape2: For data reshaping, especially during visualization.
Code
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(ggridges)
library(cluster)
library(reshape2)

We rename columns for clarity and ease of use in the analysis. The new names are shorter and more intuitive while preserving their original meaning.

Code
  dataset <- dataset_raw %>%
  rename(
    family_hist = family_history_with_overweight,
    obesity_lev = NObeyesdad,
    caloric_food = FAVC,
    vegetable_food = FCVC,
    nb_meal_day = NCP,
    food_btw_meals = CAEC,
    ch2o = CH2O,
    smoke = SMOKE,
    calorie_check = SCC,
    physical_act = FAF,
    freq_alcohol = CALC,
    use_tech = TUE,
    m_trans = MTRANS,
    gender = Gender,
    age = Age,
    weight = Weight,
    height = Height
  )

Check for missing values in the dataset, missing values are identified by counting NA values for each column.

Code
missing_values <- colSums(is.na(dataset))
kable(missing_values, format = "markdown", caption = "Missing Values in Each Column")
Missing Values in Each Column
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

Missing values are identified by counting NA values for each column. All columns contain complete data, with no missing values. If missing data were present, we could address it by either removing rows with missing values using dataset <- na.omit(dataset_row) or imputing missing values with appropriate measures (e.g. mean or median).

Check the structure of the dataset to identify data types for each variable. This helps in identifying columns that need to be converted or standardized.

Code
# Capture the structure of the dataset
str_output <- capture.output(str(dataset))
# Convert the structure output to a data frame
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

kable(str_table, format = "markdown", caption = "Structure of the Dataset")
Structure of the Dataset
Structure
‘data.frame’: 2111 obs. of 17 variables:
$ gender : chr “Female” “Female” “Male” “Male” …
$ age : num 21 21 23 27 22 29 23 22 24 22 …
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 …
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 …
$ family_hist : chr “yes” “yes” “yes” “no” …
$ caloric_food : chr “no” “no” “no” “no” …
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 …
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 …
$ food_btw_meals: chr “Sometimes” “Sometimes” “Sometimes” “Sometimes” …
$ smoke : chr “no” “yes” “no” “no” …
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 …
$ calorie_check : chr “no” “yes” “no” “no” …
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 …
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 …
$ freq_alcohol : chr “no” “Sometimes” “Frequently” “Frequently” …
$ m_trans : chr “Public_Transportation” “Public_Transportation” “Public_Transportation” “Walking” …
$ obesity_lev : chr “Normal_Weight” “Normal_Weight” “Normal_Weight” “Overweight_Level_I” …

We convert specific columns to factors for categorical interpretation during analysis. Factors ensure proper handling of discrete variables in statistical modeling.

We arranged the levels of the obesity categories, food consumption between meals, and the frequency of alcohol use to follow a logical ordinal progression, ensuring these variables accurately reflect increasing severity or frequency for improved interpretability and analysis.

Code
dataset <- dataset %>%
  mutate(
    gender = as.factor(gender),
    family_hist = as.factor(family_hist),
    caloric_food = as.factor(caloric_food),
    smoke = as.factor(smoke),
    calorie_check = as.factor(calorie_check),
    m_trans = as.factor(m_trans),
    obesity_lev = factor(obesity_lev, 
                         levels = c("Insufficient_Weight", "Normal_Weight", 
                                    "Overweight_Level_I", "Overweight_Level_II", 
                                    "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), 
                         ordered = TRUE),
    food_btw_meals = factor(ifelse(food_btw_meals == "no", "No", food_btw_meals), 
                            levels = c("No", "Sometimes", "Frequently", "Always"), 
                            ordered = TRUE),
    freq_alcohol = factor(ifelse(freq_alcohol == "no", "No", freq_alcohol), 
                          levels = c("No", "Sometimes", "Frequently", "Always"), 
                          ordered = TRUE))

Using str() before and after confirms that each variable has the correct data type, preventing errors during analysis.

Code
str_output <- capture.output(str(dataset))
str_table <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

kable(str_table, format = "markdown", caption = "Structure of the Dataset")
Structure of the Dataset
Structure
‘data.frame’: 2111 obs. of 17 variables:
$ gender : Factor w/ 2 levels “Female”,“Male”: 1 1 2 2 2 2 1 2 2 2 …
$ age : num 21 21 23 27 22 29 23 22 24 22 …
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 …
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 …
$ family_hist : Factor w/ 2 levels “no”,“yes”: 2 2 2 1 1 1 2 1 2 2 …
$ caloric_food : Factor w/ 2 levels “no”,“yes”: 1 1 1 1 1 2 2 1 2 2 …
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 …
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 …
$ food_btw_meals: Ord.factor w/ 4 levels “No”<“Sometimes”<..: 2 2 2 2 2 2 2 2 2 2 …
$ smoke : Factor w/ 2 levels “no”,“yes”: 1 2 1 1 1 1 1 1 1 1 …
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 …
$ calorie_check : Factor w/ 2 levels “no”,“yes”: 1 2 1 1 1 1 1 1 1 1 …
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 …
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 …
$ freq_alcohol : Ord.factor w/ 4 levels “No”<“Sometimes”<..: 1 2 3 3 2 2 2 2 3 1 …
$ m_trans : Factor w/ 5 levels “Automobile”,“Bike”,..: 4 4 4 5 4 1 3 4 4 4 …
$ obesity_lev : Ord.factor w/ 7 levels “Insufficient_Weight”<..: 2 2 2 3 4 2 2 2 2 2 …

Check for duplicated rows in the dataset.

Code
duplicated_rows <- sum(duplicated(dataset))
duplicated_rows
[1] 24

Keep only one instance of each duplicated row.

Code
dataset <- dataset %>%
  distinct()

Check the number of rows after removing duplicates.

Code
nrow(dataset)
[1] 2087
Code
any(duplicated(dataset))
[1] FALSE

In-depth analysis of SMOTE’s impact and visualization of class Distribution

Code
ggplot(dataset, aes(x = obesity_lev)) +
  geom_bar(fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Class Distribution of Obesity Levels",
    x = "Obesity Level",
    y = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #Adjusted the text for clarity

After applying SMOTE, the distribution is noticeably more balanced across all categories, with each class showing a similar count. This outcome reflects SMOTE’s intended effect of addressing class imbalance.

Distribution analysis

Density plot for age.

Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
    labs(
    title = "Age Distribution by Obesity Levels",
    x = "Age",
    y = "Density",
    fill = "Obesity Level") +
  xlim(14, 50) # Limit the x-axis to 0–50

This graph allows us to assess the age distribution across obesity levels and to evaluate the impact of the SMOTE algorithm in generating synthetic data. Two key takeaways emerge: first, the distributions show a clear separation between obesity categories, particularly with younger ages dominating in lower obesity levels (e.g., Insufficient Weight and Normal Weight) and older ages appearing more prominently in higher obesity levels (e.g., Obesity Type II and III). Second, sharp peaks, such as the one around age 30 in “Obesity Type I,” could signal potential artifacts from data synthesis. While these patterns indicate that the dataset maintains logical trends, further validation is necessary to confirm that these separations and peaks reflect realistic population characteristics and not artificial biases introduced during data augmentation. Overall, the dataset appears well-structured, but these observations warrant careful consideration during analysis.

Summary statistics by obesity level.

Code
dataset_stat <- dataset %>%
  group_by(obesity_lev) %>%
  summarize(
    Age_Mean = mean(age, na.rm = TRUE),
    Age_SD = sd(age, na.rm = TRUE),
    Height_Mean = mean(height, na.rm = TRUE),
    Height_SD = sd(height, na.rm = TRUE),
    Weight_Mean = mean(weight, na.rm = TRUE),
    Weight_SD = sd(weight, na.rm = TRUE)
  )
kable(dataset_stat,format = "markdown",caption = "Summary statistics by obesity level",digits = 1)
Summary statistics by obesity level
obesity_lev Age_Mean Age_SD Height_Mean Height_SD Weight_Mean Weight_SD
Insufficient_Weight 19.8 2.7 1.7 0.1 50.0 6.0
Normal_Weight 21.8 5.1 1.7 0.1 62.2 9.3
Overweight_Level_I 23.5 6.3 1.7 0.1 74.5 8.6
Overweight_Level_II 27.0 8.1 1.7 0.1 82.1 8.5
Obesity_Type_I 25.9 7.8 1.7 0.1 92.9 11.5
Obesity_Type_II 28.2 4.9 1.8 0.1 115.3 8.0
Obesity_Type_III 23.5 2.8 1.7 0.1 120.9 15.5

The summary statistics show relatively consistent means and standard deviations for Age, Height, and Weight across obesity levels, which suggests that SMOTE has preserved the overall distribution without introducing extreme values. Interpretation: Since the means and standard deviations are similar across classes, it appears SMOTE didn’t drastically alter the dataset’s variability. This consistency supports the idea that SMOTE effectively balanced the classes without distorting key variable distributions.

Perform K-means clustering and calculate silhouette score.

Code
library(cluster)
set.seed(123)
kmeans_res <- kmeans(select(dataset, where(is.numeric)), centers = length(unique(dataset$obesity_lev)))
silhouette_score <- silhouette(kmeans_res$cluster, dist(select(dataset, where(is.numeric))))
mean_silhouette_score <- mean(silhouette_score[, "sil_width"])
mean_silhouette_score
[1] 0.4513519

Silhouette Score from K-means Clustering: The mean silhouette score of approximately 0.456 suggests a moderate level of cohesion within clusters and some separation between them. This score indicates that the clusters (representing obesity levels) are neither too distinct nor too blended. Interpretation: A score close to 0.5 generally reflects reasonable class separability without excessive artificial separability. This score suggests that SMOTE has helped create distinguishable but not overly isolated clusters, which is desirable for class balance. We conclude that SMOTE has balanced the dataset without drastically distorting it.

Creating a Numerical Dataset “dataset_num”.

Code
dataset_num <- dataset %>%
  mutate(obesity_lev = recode(obesity_lev,
                              "Insufficient_Weight"=1,
                              "Normal_Weight" = 2,
                              "Overweight_Level_I" = 3,
                              "Overweight_Level_II" = 4,
                              "Obesity_Type_I" = 5,
                              "Obesity_Type_II" = 6,
                              "Obesity_Type_III" = 7,
  ))

dataset_num <- dataset %>%
  mutate(freq_alcohol = recode(freq_alcohol,
                               "No"=1,        
                               "Sometimes"=2, 
                               "Frequently" =3,
                               "Always"  =4 
  ))

dataset_num <- dataset %>%
  mutate(m_trans = recode(m_trans,
                          "Automobile"=1,
                          "Bike"=2,
                          "Motorbike"=3,
                          "Public_Transportation"=4,
                          "Walking"=5,
  ))

dataset_num <- dataset %>%
  mutate(food_btw_meals = recode(food_btw_meals,
                                 "No"=0,
                                 "Sometimes"=1 ,
                                 "Frequently"=2,
                                 "Always"=3
  )
  )

dataset_num <- dataset_num%>%
  mutate(calorie_check = recode(calorie_check,
                                "no"=0,
                                "yes"=1 ,
  ))

dataset_num <- dataset_num %>%
  mutate(across(where(is.factor), ~ as.numeric(.)))


str_output <- capture.output(str(dataset_num))
table_num_str <- data.frame(Structure = str_output, stringsAsFactors = FALSE)

kable(table_num_str,format =  "markdown", caption = "structure of the numerical dataset")
structure of the numerical dataset
Structure
‘data.frame’: 2087 obs. of 17 variables:
$ gender : num 1 1 2 2 2 2 1 2 2 2 …
$ age : num 21 21 23 27 22 29 23 22 24 22 …
$ height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 …
$ weight : num 64 56 77 87 89.8 53 55 53 64 68 …
$ family_hist : num 2 2 2 1 1 1 2 1 2 2 …
$ caloric_food : num 1 1 1 1 1 2 2 1 2 2 …
$ vegetable_food: num 2 3 2 3 2 2 3 2 3 2 …
$ nb_meal_day : num 3 3 3 3 1 3 3 3 3 3 …
$ food_btw_meals: num 1 1 1 1 1 1 1 1 1 1 …
$ smoke : num 1 2 1 1 1 1 1 1 1 1 …
$ ch2o : num 2 3 2 2 2 2 2 2 2 2 …
$ calorie_check : num 0 1 0 0 0 0 0 0 0 0 …
$ physical_act : num 0 3 2 2 0 0 1 3 1 1 …
$ use_tech : num 1 0 1 0 0 0 0 0 1 1 …
$ freq_alcohol : num 1 2 3 3 2 2 2 2 3 1 …
$ m_trans : num 4 4 4 5 4 1 3 4 4 4 …
$ obesity_lev : num 2 2 2 3 4 2 2 2 2 2 …

2.3.1 Spotting Mistakes and Missing Data

We verified the presence of any potential NA values that might have arisen during the conversion of categorical variables to numeric format.

Code
nb_na<- colSums(is.na(dataset_num))
kable(nb_na, format = "markdown",caption = "Presence of potential NA values in the dataset")
Presence of potential NA values in the dataset
x
gender 0
age 0
height 0
weight 0
family_hist 0
caloric_food 0
vegetable_food 0
nb_meal_day 0
food_btw_meals 0
smoke 0
ch2o 0
calorie_check 0
physical_act 0
use_tech 0
freq_alcohol 0
m_trans 0
obesity_lev 0

The results of the test confirmed that there are no NA values in the dataset, indicating that all variables were successfully converted to numeric format while retaining their integrity.

2.3.2 Listing Anomalies and Outliers

2.4 Correlation Analysis

In order to select the possible factor influencing obesity level.

We computed a correlation matrix to analyze the relationships between numeric variables, focusing on their associations with obesity_lev. Variables were reordered by the strength of their correlation with obesity_lev for clarity. A heatmap was generated using a diverging color gradient to visualize these correlations, with red indicating strong positive relationships, blue for negative, and white for weak or neutral. Numerical labels and rotated axis labels were added to improve interpretability, highlighting key factors linked to obesity levels.

Code
#Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>%
                    select("physical_act", "freq_alcohol", "obesity_lev", "age",
                           "weight","height", "family_hist", "caloric_food",
                           "vegetable_food", "food_btw_meals", "use_tech", "ch2o",
                           "m_trans", "smoke","nb_meal_day", "calorie_check",
                           "gender"),
                  use = "complete.obs")

#Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

#Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

#Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

#Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)

ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
    scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        # Rotate x-axis labels for readability
        axis.text.y = element_text(angle = 45, vjust = 1) 
        # Rotate y-axis labels for readability
  )

Code
# Create the heatmap with correlation values

# Assuming dataset_num is already defined and contains the relevant columns
cor_matrix <- cor(dataset_num %>%
                    select("physical_act", "freq_alcohol", "obesity_lev", "age",
                           "weight", "family_hist", "caloric_food",
                           "vegetable_food", "food_btw_meals",
                           "use_tech","ch2o", "height",
                           "calorie_check", "gender"),
                  use = "complete.obs")

# Extract the correlations with 'obesity_lev'
cor_with_obesity_lev <- cor_matrix["obesity_lev",]

# Order variables by their correlation with 'obesity_lev'
ordered_vars <- names(sort(cor_with_obesity_lev, decreasing = TRUE))

# Reorder the correlation matrix based on this order
cor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]

# Melt the ordered correlation matrix into long format
cor_long <- melt(cor_matrix_ordered)


ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 2.5, vjust = 0.5
            , hjust = 0.5) + # Center text within tiles
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap Ordered by Obesity Level", x = "Variables", y
       = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        # Rotate x-axis labels for readability
        axis.text.y = element_text(angle = 45, vjust = 1) 
        # Rotate y-axis labels for readability
  )

3 3. Exploratory Data Analysis (EDA)

3.0.0.1 3.1 Descriptive statistics and distribution analysis

3.0.0.1.1 Age

Descriptive statistic for Age

Code
summary(dataset$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.00   19.92   22.85   24.35   26.00   61.00 
Code
sd(dataset$age, na.rm = TRUE)
[1] 6.368801

Age distribution

The age data shows a right-skewed distribution, with a mean of 24.3 years and a median of 22.78 years. The range (14 to 61 years) covers a wide age span, but most individuals are concentrated in the 20–30 age range. The standard deviation (6.35 years) suggests moderate variability in the dataset. This young population distribution may limit the applicability of results to older age groups, where obesity risk factors could differ.

Age Distribution by Obesity Level (Violin Plot)

The age distribution varies across obesity levels,highlighting distinct trends. Insufficient and normal weight categories are concentrated among younger individuals (14–30), while overweight and obesity levels shift towards mid-adulthood (20–40), peaking around 30–35 years. Severe obesity (Type III) is rare in younger ages and more common in the 30–40 range. These patterns suggest the progression of weight issues with age and emphasize the need for targeted interventions during early to mid-adulthood to prevent worsening obesity levels.

Code
ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  labs(title = "Age Distribution by Obesity Level", x = "Obesity Level", y = "Age") +
  theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1))

The violin plot shows, more clearly, how individuals in the lower obesity categories, such as insufficient and normal weight, are predominantly younger, with ages concentrated between 14 and 30 years. In contrast, higher obesity levels exhibit a broader age range, with a peak density observed around 30–40 years, particularly in Obesity Type I and Type II. Severe obesity (Type III) is rare in younger individuals and becomes more prominent in the mid-adulthood age group. This visualization underscores the gradual progression of obesity risk with age and emphasizes the critical need for early intervention strategies to address weight-related health issues, particularly during early and mid-adulthood when such risks become more pronounced.

Age Distribution with SMOOTH Trend Line for Obesity Probability.

Code
ggplot(dataset, aes(x = age, y = as.numeric(obesity_lev))) +
  geom_jitter(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Trend of Obesity Level with Age", x = "Age", y = "Obesity Level") +
  theme_minimal()

The graph shows a smooth trend line capturing the overall pattern. Obesity levels increase significantly from adolescence to early adulthood, peaking around the 25–30 years age range. This period potentially represents a critical transition, where lifestyle factors such as reduced physical activity, higher caloric intake, and metabolic changes can contribute to the steep rise in obesity levels.

Beyond the peak, the trend shows a gradual decline in obesity levels after 30 years, which may reflect behavioral changes, such as increased health awareness, dietary improvements, or a selection bias in older age groups. This switch suggests that mid-20s to early-30s is a pivotal stage for interventions aimed at mitigating obesity risk.

3.0.0.1.2 Height

Descriptive statistic for Height.

Code
summary(dataset$height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.450   1.630   1.702   1.703   1.769   1.980 
Code
sd(dataset$height, na.rm = TRUE)
[1] 0.09318594

Height distribution.

Code
ggplot(dataset, aes(x = height)) +
  geom_histogram(bins = 20, fill = "purple", color = "black", alpha = 0.7) +
  labs(title = "Height Distribution", x = "Height (m)", y = "Count") +
  theme_minimal()

The height histogram shows the height distribution (in meters) and is approximately normal, with a slight right skew. Most values fall between 1.45m and 1.98m, with a peak around 1.8m, indicating it’s the most frequent height. The range is realistic, with no visible extreme outliers, and the standard deviation (0.09) indicates low variability. I would like to add that the mean and median are both 1.7m, confirming a nearly symmetrical distribution.

Height by Obesity Level

Box Plot of Height by Obesity Level.

Code
ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +
  geom_violin(alpha = 0.6) +
  labs(title = "Height Distribution by Obesity Level", x = "Obesity Level", y = "Height") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

The plot shows for height, relatively low variability within each category, with overlapping ranges between most groups. Individuals with Insufficient Weight and Normal Weight have slightly narrower distributions, centered around similar heights (~1.7 m). As obesity levels increase (e.g., Obesity Type I–III), the distributions remain consistent, suggesting height is not strongly associated with obesity classification. This suggests that weight may be more influential than height alone in determining obesity level.

3.0.0.1.3 Weight

Descriptive statistic for Weight.

Code
summary(dataset$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  39.00   66.00   83.10   86.86  108.02  173.00 
Code
sd(dataset$weight, na.rm = TRUE)
[1] 26.19085

Weight by gender

Density plot for weight distribution by gender.

Code
ggplot(dataset, aes(x = weight, fill = gender)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Weight by Gender", x = "Weight", y = "Density") +
  scale_fill_manual(values = c("pink", "lightblue"), name = "Gender", labels = c("Female", "Male")) +
  theme_minimal()

The density plot reveals distinct weight distributions between genders. Females generally weight less, with a peak around 70 units, while males peak around 85 and 115 units, indicating a tendency toward higher weights. The overlapping region around 80-90 units shows weights common to both genders, but the distinct density peaks emphasize gender-based differences in weight distribution. Overall, males dominate at higher ranges Weight ranges from 39 to 173 units, with an average (mean) weight of 86.6 units. The median weight is 83 units, with a standard deviation of 26.2, indicating moderate spread.

Weight by obesity level

Ridgeline Plot of Weight by Obesity Level.

Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +
  geom_density_ridges(scale = 0.9, alpha = 0.6) +
  labs(title = "Ridgeline Plot of Weight by Obesity Level", x = "Weight", y = "Obesity Level") +
  theme_minimal() +
  theme(legend.position = "none")

This ridgeline plot shows a clear progression in weight distribution across different obesity levels. As the obesity level increases, the weight distribution shifts progressively to higher ranges. “Normal Weight” and “Insufficient Weight” categories are concentrated at lower weights, while higher obesity types (I, II, and III) peak at significantly greater weights, indicating a strong positive association between weight and obesity level The weight distribution has an average of 86.6 kg and a standard deviation of 26.6 kg.

3.0.0.1.4 Height and Weight

Scatter Plot (height vs weight), colored by obesity level.

Code
ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = obesity_lev)) +  # Adds a trend line for each obesity level
  ggtitle("Scatter Plot of Weight vs Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level")

Facet Grid for Height and Weight by Obesity Level.

Code
ggplot(dataset, aes(x = height, y = weight)) +
  geom_point(alpha = 0.7, aes(color = obesity_lev)) +
  facet_wrap(~ obesity_lev) +
  ggtitle("Facet Grid of Weight and Height by Obesity Level") +
  theme_minimal() +
  labs(x = "Height", y = "Weight", color = "Obesity Level") +
  theme(legend.position = "none")

The scatter plot with trend lines for each obesity level reveals a clear positive correlation between weight and height across all obesity levels. As the obesity level increases, the slope generally becomes steeper, indicating a stronger weight gain relative to height. We created the facet grid to show more clearly the trends to show more clearly how The “Obesity_Type_III” (yellow) category has the steepest slope, suggesting a significant weight increase per unit of height, which is consistent with the highest level of obesity.

Correlation between height and weight.

Code
correlation_height_weight <- cor(dataset$height, dataset$weight, use = "complete.obs")
correlation_height_weight
[1] 0.457468

The correlation observed between height and weight (r = 0.463) aligns with existing literature, confirming the expected positive relationship between these variables.

3.0.0.1.5 Food between meals
Code
# Dodged Bar Chart for food_btw_meals by obesity levels
ggplot(dataset, aes(x = food_btw_meals, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Food Between Meals by Obesity Levels") +
   labs(x = "Food Between Meals", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14))

Code
# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)
ggplot(dataset, aes(x = obesity_lev, fill = food_btw_meals)) +
    geom_bar(position = "fill") + # Stacked bar chart with proportions
    scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # Format y-axis as percentages
    ggtitle("Proportion of Food Between Meals Across Obesity Levels") + # Shortened and clear title
    labs(x = "Obesity Levels", y = "Proportion (%)", fill = "Food Between Meals") + # Correct axis and legend labels
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis text for readability
        plot.title = element_text(hjust = 0.5, size = 14) # Center and style the title
    )

These charts provide a clear illustration of how the frequency of eating between meals varies across obesity levels. The most dominant behavior across all categories is “Sometimes,” which peaks in intermediate levels like Normal Weight and Overweight Level I, reflecting a common pattern of moderate snacking. However, as obesity levels increase to Obesity Types I–III, the responses for “Frequently” and “Always” diminish, while “Sometimes” becomes even more prevalent. This shift could indicate that higher obesity levels are more associated with habitual moderate snacking rather than excessive meal-snacking frequency. On the other hand, “No” responses remain negligible across all obesity levels, suggesting that eating between meals is almost universal in this population. This pattern underscores the importance of examining not just the frequency but also the quality and context of snacking as potential contributors to obesity progression.

3.0.0.1.6 High-caloric food consumption
Code
# Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels
ggplot(dataset, aes(x = caloric_food, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for High-Caloric Food Consumption by Obesity Levels") +
   labs(x = "High-Caloric Food Consumption", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

The dodged bar chart clearly shows that the majority of individuals, especially in the higher obesity categories (Obesity Type I–III), report consuming high-caloric foods (“yes”). This trend becomes increasingly pronounced as obesity levels rise, with very few individuals reporting “no” consumption in these categories. In contrast, lower obesity levels (e.g., Normal Weight, Overweight Level I) show a slightly higher representation of “no” responses, indicating a potential shift in dietary habits across obesity levels.

Code
# Grouped Bar Chart of High-Caloric Food by Obesity Level (Proportions within each Obesity Level)
ggplot(dataset, aes(x = obesity_lev, fill = caloric_food)) +
  geom_bar(position = "dodge", aes(y = (..count..) / tapply(..count.., ..x.., sum)[..x..]), color = "black") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  ggtitle("                                 Grouped Bar Chart of High-Caloric Food Consumption Across Obesity Levels") +
  labs(x = "Obesity Levels", y = "Proportion (%)", fill = "High-Caloric Food Consumption") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14)
  )
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

The grouped bar chart effectively shows the behavioral shift toward higher high-caloric food consumption as obesity levels increase. High-caloric food consumption (“yes”) consistently accounts for over 75% of responses, becoming nearly universal in higher obesity categories (Obesity Type I–III). In contrast, “no” responses are more visible in lower obesity levels, such as Insufficient Weight and Normal Weight, but remain a minority.

Code
percentage_high_caloric_consumers <- mean(dataset$caloric_food == "yes") * 100
percentage_high_caloric_consumers
[1] 88.35649

More precisely, a notable 88.4% of participants report frequent consumption of high-calorie foods, which may directly contribute to weight gain, highlighting the need for dietary interventions focused on reducing high-calorie intake.

3.0.0.1.7 Alcohol consumption

Frequence in consumption of alcohol.

Code
# Filter out "Always" responses from the dataset
filtered_dataset <- dataset %>%
  filter(freq_alcohol != "Always")

# Dodged Bar Chart for freq_alcohol by Obesity Levels (excluding "Always")
ggplot(filtered_dataset, aes(x = freq_alcohol, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("Dodged Bar Chart for Alcohol Consumption by Obesity Levels") +
   labs(x = "Alcohol Consumption Frequency", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

The chart shows that “Sometimes” is the dominant alcohol consumption frequency across all obesity levels, particularly in Normal Weight, Overweight Level I, and II categories. As obesity increases, “Frequently” becomes slightly more prominent, especially in Obesity Type III, while “No” responses decrease, being more common in lower obesity levels such as Insufficient and Normal Weight. The “Always” responses are excluded from this chart due to their near absence in the dataset, highlighting that excessive alcohol consumption is rare. This trend underlines the potential relationship between moderate-to-frequent alcohol consumption and higher obesity levels, emphasizing its importance for obesity-related behavioral research.

Code
# Prepare the data summary for 'Sometimes' and 'No' responses
data_summary <- dataset %>%
  filter(freq_alcohol %in% c("Sometimes", "No")) %>%
  group_by(obesity_lev, freq_alcohol) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(obesity_lev) %>%
  mutate(
    total = sum(count),
    proportion = count / total
  ) %>%
  ungroup()

# Visualization with updated title
ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +  # Format y-axis as percentages
  ggtitle("Proportion of 'Sometimes' and 'No' Alcohol Responses by Obesity Level") +
  labs(x = "Obesity Level", y = "Proportion (%)", color = "Alcohol Frequency") +
  scale_color_manual(values = c("No" = "purple", "Sometimes" = "gold")) + # Improved color scheme
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 14),  # Center and style title
    legend.position = "top"
  )

The proportion of individuals who drink alcohol “Sometimes” increases with higher obesity levels, peaking in Obesity_Type_III. In contrast, the likelihood of abstaining from alcohol (“no”) decreases as obesity levels rise. This pattern suggests that moderate alcohol consumption may be associated with higher obesity levels, while abstention is more common among those with lower obesity levels.

A possible interaction to investigate later is between alcohol frequency and caloric food preference, as both behaviors seem linked to higher obesity levels. Exploring this could reveal if individuals with a preference for caloric foods and moderate alcohol consumption have a compounding effect on obesity risk. This investigation could help clarify whether combined lifestyle factors contribute more significantly to higher obesity levels than each factor alone.

Monitoring of the calories in the day.

Code
# Dodged Bar Chart for calorie_check by Obesity Levels
ggplot(dataset, aes(x = calorie_check, fill = obesity_lev)) +
   geom_bar(position = "dodge", color = "black") +
   ggtitle("    Dodged Bar Chart for the check of the calories by Obesity Levels") +
   labs(x = "High-Caloric Food Consumption", y = "Count", fill = "Obesity Levels") +
   theme_minimal() +
   theme(
         plot.title = element_text(hjust = 0.5, size = 14)) # Center and style the title

Code
data_summary <- dataset %>%
  group_by(obesity_lev, calorie_check) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(total = sum(count), proportion = count / total)

ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("no" = "lightcoral", "yes" = "lightblue")) +
  labs(title = "Proportion of Calorie Checking by Obesity Level", x = "Obesity Level", y = "Proportion", color = "Calorie Check") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The Dodged Bar Chart highlights two main trends regarding calorie-checking behavior across obesity levels: a significant increase in “Yes” responses as obesity levels rise, particularly from Overweight Level II onward, and a decrease in “No” responses, which are more prevalent in lower obesity levels like Normal Weight and Insufficient Weight. The second graph simplifies these trends by clearly illustrating the proportional shift between “Yes” and “No” responses, making the contrast between lower and higher obesity levels more visually apparent. Together, these visualizations emphasize a potential association between obesity severity and an increased tendency to check calorie intake, suggesting heightened dietary awareness in higher obesity categories.

3.0.0.1.8 Vegetable consumption
Code
ggplot(dataset, aes(x = vegetable_food)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.6) +
  geom_density(color = "darkgreen", size = 1) +
  ggtitle("Histogram and Density of Vegetable Food Consumption") +
  theme_minimal() +
  labs(x = "Vegetable Food Consumption", y = "Density")

Code
ggplot(dataset, aes(x = weight, y = vegetable_food, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Vegetable Food Consumption", 
         x = "Weight", 
         y = "Vegetable Food Consumption") +
    theme_minimal() +
    coord_cartesian(xlim= c(40, 135), ylim= c(2, 3))

The scatterplot provided with the trend line illustrates a distinct, non-linear relationship: vegetable consumption initially decreases as weight increases but then begins to rise again at higher weight levels.

This pattern suggests that individuals with lower weight, particularly those in the Insufficient Weight and Normal Weight categories, tend to report higher vegetable consumption. As weight progresses toward the Overweight categories, vegetable consumption decreases slightly, indicating a possible reduction in healthy dietary habits. However, at the upper end of the weight spectrum, corresponding to Obesity Type II and Obesity Type III, vegetable consumption increases again, potentially due to dietary interventions or awareness in this group.

The trend reveals two possible key insights:

  • A dip in vegetable consumption occurs in intermediate weight ranges, aligning with the overweight population.
  • The sharp increase in vegetable consumption among the most obese individuals may reflect lifestyle adjustments prompted by health concerns or medical advice.
3.0.0.1.9 Physical activity

Plot histogram and density.

Code
ggplot(dataset, aes(x = physical_act)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Physical Activity") +
  theme_minimal() +
  labs(x = "Physical Activity", y = "Density")

The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.

Violin plot by category.

Code
ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +  # Replace 'obesity_lev' with any category variable
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, color = "black", fill = "white") +
  ggtitle("Violin Plot of Physical Activity by Obesity Level") +
  theme_minimal() +
  labs(x = "Obesity Level", y = "Physical Activity") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Physical activity levels show a slight decline as obesity levels increase, particularly evident in the narrowing distributions and lower medians observed for Obesity Type II and Obesity Type III categories. In contrast, the Insufficient Weight and Normal Weight groups exhibit higher physical activity levels, as reflected by their broader and more symmetrical distributions.

The graph reveals a distinct trend: individuals in lower obesity categories engage in more physical activity compared to those in higher obesity categories. This trend suggests an inverse relationship between physical activity and obesity levels.

3.0.0.1.10 Water consumption

Plot histogram and density for water consumption.

Code
ggplot(dataset, aes(x = ch2o)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black", alpha = 0.6) +
  geom_density(color = "darkblue", size = 1) +
  ggtitle("Histogram and Density of Comsumption of Water") +
  theme_minimal() +
  labs(x = "CH2O", y = "Density")

This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.

Violin Plot by Gender.

Code
# Scatterplot with a LOESS trend line
ggplot(dataset, aes(x = weight, y = ch2o, color = obesity_lev)) +
    geom_point(alpha = 0.6) +
    geom_smooth(method = "loess", se = FALSE, color = "black") +
    labs(title = "Scatterplot of Weight vs Water Consumption", x = "Weight", y = "Water Consumption (ch2o)") +
    theme_minimal() +
coord_cartesian(xlim= c(35, 135))

The scatterplot visualizes the relationship between weight and water consumption (ch2o), categorized by obesity levels. The trend line reveals a slightly increasing pattern of water consumption as weight increases, though the relationship is relatively weak and mostly linear.

This pattern suggests that individuals with Insufficient Weight and Normal Weight categories generally report slightly lower water consumption compared to individuals in the higher weight categories, such as Obesity Type II and III. The increase in water consumption among higher weight groups could indicate attempts to adopt healthier habits or increased hydration needs due to larger body sizes. However, the relatively flat trend across most weight ranges suggests that water consumption does not vary dramatically across different weight categories, highlighting a potential area for targeted interventions to promote hydration as a component of healthy dietary behavior.

3.0.0.1.11 Technology utilization

Histogram with Density.

Code
ggplot(dataset, aes(x = use_tech)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black", alpha = 0.6) +
  geom_density(color = "blue", size = 1) +
  labs(title = "Histogram and Density of Use of Technology", x = "Use of Technology", y = "Density") +
  theme_minimal()

Density of Use of Technology by Obesity Level.

Code
ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Use of Technology by Obesity Level", x = "Use of Technology", y = "Density") +
  theme_minimal()

This density plot provides a perspective on the use of technology across different obesity levels. A striking feature is the sharp, dominant peak in Obesity Type III (yellow) around the value of 1. This pattern diverges notably from the smoother and more evenly distributed curves seen in other obesity categories, suggesting a unique behavioral trend in this group.

The peak indicates a strong clustering of individuals in Obesity Type III who report moderate use of technology, which may reflect consistent engagement with technology-based activities such as sedentary work, entertainment, or even health-monitoring applications. In contrast, other obesity categories, such as Obesity Type II and Overweight Level II, exhibit more balanced distributions without a single dominant peak, hinting at more varied technology usage patterns.

This observation raises interesting questions about the role of technology in shaping lifestyle behaviors in Obesity Type III individuals. It may point to a reliance on technology that correlates with a sedentary lifestyle, a known risk factor for obesity. Alternatively, it could reflect targeted interventions or habits specific to this group.

4 Analysis

The analysis phase is dedicated to the development, refinement, and comprehensive evaluation of the predictive models, meticulously designed to directly address the previously defined research questions.

4.1 Methods

The modeling process is structured to address the two key research questions:

  1. identifying the most significant lifestyle and behavioral factors contributing to obesity in Mexico, Peru, and Colombia;

  2. predicting whether a person will be obese based on some given combinations of factors.

4.1.1 Linear Regression Model

A linear regression model will be developed to predict an individual’s BMI using weight and height as predictors, reflecting their foundational role in BMI calculation. As emphasized by Mendoza Palechor and De La Hoz Manotas (2019), these variables are fundamental to understanding body composition and are directly tied to the dataset’s variable of obesity levels. By focusing on BMI as a continuous outcome, this approach complements categorical classifications by capturing more detailed variations in body composition across the population. The model will be evaluated using standard regression metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R², ensuring its predictive accuracy and reliability while providing a robust foundation for public health applications.

4.1.2 Logistic Regression Model

To accurately address the key research questions, a logistic regression model will be employed to estimate the probability of individuals belonging to a categorie: obese or not obese. Weight and height will be excluded as predictors in the model because they are directly used to calculate BMI, which serves as the basis for the obesity levels categorized in the dataset. Including these variables would create a dependency between the predictors and the target variable, potentially biasing the analysis. By excluding weight and height, the focus shifts to behavioral and lifestyle factors, such as dietary habits, physical activity, and demographic characteristics, to better understand their influence on obesity risk.

While logistic regression provides a clear and interpretable framework for estimating probabilities, it inherently limits the analysis to a binary classification. This restriction prevents the exploration of the full spectrum of obesity levels, such as Obesity Type I, II, or III, as classified in the dataset. Despite this limitation, logistic regression is a robust method for quantifying the relationships between independent variables and the binary outcome. Feature selection techniques will ensure that only the most relevant predictors are retained, and the model’s performance will be rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, ensuring reliable and actionable insights.

4.1.3 Insights and Limitations

Regression analysis helps us understand how predictors influence outcomes, with logistic regression classifying individuals as obese or not obese and linear regression predicting BMI as a continuous variable. The dataset offers a mix of advantages and challenges: synthetic data ensures balanced representation but lacks the complexity of real-world patterns, while user-collected data adds variability but is prone to biases. Logistic regression simplifies the analysis by focusing on binary outcomes, leaving out the nuanced gradations of obesity, and assumes linearity, which may not fully capture complex relationships. Linear regression relies on accurate weight and height data, making it sensitive to reporting errors. Despite these limitations, the models offer insights into obesity risk and body composition, serving as a valuable exercise and foundation for future projects, even if not directly applicable to real-world scenarios.

4.2 Goals for Each Method

4.2.1 Linear Regression Model Development

Data Loading and Processing

The dataset was imported, and initial exploration was conducted to understand its structure. BMI was calculated as a key variable, and missing values were addressed by removing incomplete rows. Boxplots were used to visualize the distributions of key variables, ensuring the dataset was ready for analysis.

Code
dataset_raw <- read.csv(here("data/raw/dataset_raw.csv"))
head(dataset_raw)
  Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female  21   1.62   64.0                            yes   no    2   3
2 Female  21   1.52   56.0                            yes   no    3   3
3   Male  23   1.80   77.0                            yes   no    2   3
4   Male  27   1.80   87.0                             no   no    3   3
5   Male  22   1.78   89.8                             no   no    2   1
6   Male  29   1.62   53.0                             no  yes    2   3
       CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
1 Sometimes    no    2  no   0   1         no Public_Transportation
2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
4 Sometimes    no    2  no   2   0 Frequently               Walking
5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
6 Sometimes    no    2  no   0   0  Sometimes            Automobile
           NObeyesdad
1       Normal_Weight
2       Normal_Weight
3       Normal_Weight
4  Overweight_Level_I
5 Overweight_Level_II
6       Normal_Weight
Code
summary(dataset_raw)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight     FAVC                FCVC      
 Length:2111                    Length:2111        Min.   :1.000  
 Class :character               Class :character   1st Qu.:2.000  
 Mode  :character               Mode  :character   Median :2.386  
                                                   Mean   :2.419  
                                                   3rd Qu.:3.000  
                                                   Max.   :3.000  
      NCP            CAEC              SMOKE                CH2O      
 Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
 1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
 Median :3.000   Mode  :character   Mode  :character   Median :2.000  
 Mean   :2.686                                         Mean   :2.008  
 3rd Qu.:3.000                                         3rd Qu.:2.477  
 Max.   :4.000                                         Max.   :3.000  
     SCC                 FAF              TUE             CALC          
 Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
 Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
                    Mean   :1.0103   Mean   :0.6579                     
                    3rd Qu.:1.6667   3rd Qu.:1.0000                     
                    Max.   :3.0000   Max.   :2.0000                     
    MTRANS           NObeyesdad       
 Length:2111        Length:2111       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
Code
dataset_raw$BMI <- dataset_raw$Weight / (dataset_raw$Height^2)
summary(dataset_raw$BMI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   24.33   28.72   29.70   36.02   50.81 
Code
if (sum(is.na(dataset_raw)) > 0) {cat("Valori mancanti rilevati! Rimuovendo righe con NA.\n")
  dataset_raw <- na.omit(dataset_raw)}
boxplot(dataset_raw$Weight, main = "Distribuzione del Peso", col = "grey", border = "black", notch = TRUE, horizontal = TRUE, xlab = "Peso (kg)", ylim = c(30, 200))
grid(nx = NULL, ny = NULL, lty = 0.5, col = "black")

Code
boxplot(dataset_raw$Height, main = "Distribuzione dell'Altezza", col = "lightgreen", border = "darkgreen", notch = TRUE, ylab = "Altezza (m)", ylim = c(1.4, 2))

Linear Regression Model Development

A linear regression model was built to examine the relationship between BMI, weight, and height. The model summary provided key performance metrics and insights into variable contributions.

Code
linear_model <- lm(BMI ~ Weight + Height, data = dataset_raw)
summary(linear_model)

Call:
lm(formula = BMI ~ Weight + Height, data = dataset_raw)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0892 -0.3809  0.1300  0.4007  2.4948 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.626e+01  3.455e-01   162.8   <2e-16 ***
Weight       3.403e-01  7.767e-04   438.1   <2e-16 ***
Height      -3.292e+01  2.180e-01  -151.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8282 on 2108 degrees of freedom
Multiple R-squared:  0.9893,    Adjusted R-squared:  0.9893 
F-statistic: 9.767e+04 on 2 and 2108 DF,  p-value: < 2.2e-16

Evaluation

The model was evaluated by generating predictions and calculating key performance metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics assess the model’s accuracy and its ability to explain the variance in BMI.

Code
library(Metrics)
predictions <- predict(linear_model, newdata = dataset_raw)
head(predictions)
       1        2        3        4        5        6 
24.70397 25.27386 23.20182 26.60432 28.21541 20.96121 
Code
mse <- mean((dataset_raw$BMI - predictions)^2)
rmse <- rmse(dataset_raw$BMI, predictions)
r_squared <- summary(linear_model)$r.squared
cat("MSE:", mse, "\nRMSE:", rmse, "\nR²:", r_squared, "\n")
MSE: 0.6848888 
RMSE: 0.8275801 
R²: 0.9893238 

Results Visualization

A scatter plot was created to compare actual BMI values with predicted values. A reference line (ideal fit) was added to assess the alignment of predictions with true values. The plot provides a visual representation of the model’s accuracy.

Code
plot(dataset_raw$BMI, predictions, xlab = "BMI Reale", ylab = "BMI Predetto", main = "Confronto tra BMI Reale e Predetto", pch = 16, col = "blue", cex = 0.6) 
abline(0, 1, col = "red", lwd = 2)
legend("topleft", legend = c("Valori Predetti", "Linea Ideale"), col = c("blue", "red"), pch = c(16, NA), lty = c(NA, 1), bty = "n")

Diagnostics

Diagnostic plots were generated to evaluate the linear regression model’s assumptions, including residual patterns, normality, and variance consistency. A histogram of residuals was also created to assess their distribution, with a vertical reference line highlighting the zero-residual point.

Code
par(mfrow = c(2, 2), mar = c(4, 4, 2.5, 2), cex.main = 1.3, cex.lab = 1, cex.axis = 1)
plot(linear_model, col = "Blue",pch = 19, cex = 0.2, lwd = 2)

Code
hist(residuals(linear_model), col = "Gray", border = "white", main = "Distribuzione dei Residui", xlab = "Residui", ylab = "Frequenza", breaks = 15, cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2)
abline(v = 0, col = "red", lwd = 2, lty = 2)

4.2.2 Logistic Regression Model Development

4.3 Results

4.3.1 5. Conclusion

So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.

The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.

These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.

4.4 Next Steps

Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.

4.5 Final Thoughts

Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.